Regularization is the name for a technique developed at different times and in different ways in statistics and machine learning for improving the predictive quality of a model. The idea is to make a model simpler than it might otherwise be by either making the coefficients small, making the coefficients zero, or perhaps some combination of both at the same time. Regularization is implemented by default in sklearn's linear models.
In [1]:
from IPython.core.pylabtools import figsize
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
plt.style.use('bmh')
from sklearn.datasets import make_classification
from sklearn.cross_validation import train_test_split, cross_val_score
from sklearn.grid_search import GridSearchCV
from sklearn.linear_model import LogisticRegression, LogisticRegressionCV
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix
from sklearn.preprocessing import StandardScaler
Let's generate some sample data. 100000 observations, 50 features, only 5 of which matter, 7 of which are redundant, split among 2 classes for classification.
In [2]:
X, y = make_classification(n_samples=100000,
n_features=50,
n_informative=5,
n_redundant=7,
n_classes=2,
random_state=2)
When building linear models, it's a good idea to standarize all of your predictors (mean at zero, variance 1).
In [3]:
figsize(12, 6)
plt.scatter(range(50), np.mean(X, axis=0));
In [4]:
scaler = StandardScaler()
X = scaler.fit_transform(X)
In [5]:
figsize(12, 6)
plt.scatter(range(50), np.mean(X, axis=0));
In [6]:
# nice normal looking predictors
figsize(18, 8)
ax = plt.subplot(441)
plt.hist(X[:, 0]);
ax = plt.subplot(442)
plt.hist(X[:, 1]);
ax = plt.subplot(443)
plt.hist(X[:, 2]);
ax = plt.subplot(444)
plt.hist(X[:, 3]);
In [7]:
# multicollinearity
correlations = np.corrcoef(X, rowvar=0)
corrpairs = {}
for i in range(50):
for j in range(i+1, 50, 1):
if correlations[i, j] > 0.25:
print(i, j, correlations[i,j])
corrpairs[(i,j)] = correlations[i,j]
In [8]:
# plot is slow - 1 min or more
figsize(12, 18)
plt.subplot(311)
plt.scatter(X[:, 16], X[:, 37])
plt.subplot(312)
plt.scatter(X[:, 16], X[:, 18])
plt.subplot(313)
plt.scatter(X[:, 26], X[:, 43]);
Let's perform a train-test split for cross-validation.
In [9]:
Xtr, Xte, ytr, yte = train_test_split(X, y, test_size=0.2, random_state=0)
Next let's build a model using the default parameters and look at several different measures of performance.
In [10]:
default_model = LogisticRegression(random_state=0).fit(Xtr,ytr) # instantiate and fit
pred = default_model.predict(Xte) # make predictions
print('Accuracy: %s\n' % default_model.score(Xte, yte))
print(classification_report(yte, pred))
print('Confusion Matrix:\n\n %s\n' % confusion_matrix(yte, pred));
In the sklearn implementation, this default model is a regularized model, using $\mathcal{l}2$ regularization with $C = 1$. That is, the cost function to be minimized is $$-\frac{1}{n}\sum_{i=1}^n[y_i\log(p_i) - (1-y_i)\log(y_i - p_i)]+\frac{1}{C}\cdot\sum_{j=1}^m w_j^2.$$ Here, $y_i$ is the $i^{th}$ response (target), $p_i$ is the predicted probability of that target, and $w_j$ are the coefficients of the linear model. In a traditional statistical implementation, the second sum wouldn't be there as it biases the model. This is the regularization.
There is no reason to believe that $C = 1$ is the ideal choice; it may be better to increase or decrease $C$. One way to search for better values by doing a grid search over a set of possible values for $C$, assessing the best choice using cross-validation.
In [11]:
cs = [10**(i+1) for i in range(2)] + [10**(-i) for i in range(5)] # create a list of C's
print(cs)
In [12]:
lm = LogisticRegression(random_state=0)
grid = GridSearchCV(estimator=lm,
param_grid=dict(C=cs),
scoring='accuracy',
verbose=1,
cv=5,
n_jobs=-1, # parallelize over all cores
refit=True) # instatiate the grid search (note model input)
grid.fit(Xtr, ytr) # fit
print("Best score: %s" % grid.best_score_)
print("Best choice of C: %s" % grid.best_estimator_.C)
In [13]:
# change the metric
grid_prec = GridSearchCV(estimator=lm,
param_grid=dict(C=cs),
scoring='precision',
verbose=1,
cv=5,
n_jobs=-1, # parallelize over all cores
refit=True) # instatiate the grid search (note model input)
grid_prec.fit(Xtr, ytr) # fit
print("Best score: %s" % grid_prec.best_score_)
print("Best choice of C: %s" % grid_prec.best_estimator_.C)
In [14]:
# change the metric
grid_auc = GridSearchCV(estimator=lm,
param_grid=dict(C=cs),
scoring='roc_auc',
verbose=1,
cv=5,
n_jobs=-1, # parallelize over all cores
refit=True) # instatiate the grid search (note model input)
grid_auc.fit(Xtr, ytr) # fit
print("Best score: %s" % grid_auc.best_score_)
print("Best choice of C: %s" % grid_auc.best_estimator_.C)
In [15]:
grid_preds = grid.predict(Xte)
print('Accuracy: %s\n' % accuracy_score(grid.predict(Xte), yte))
print(classification_report(yte, grid_preds))
print('Confusion Matrix:\n\n %s\n' % confusion_matrix(yte, grid_preds));
In [16]:
grid.best_estimator_.coef_
Out[16]:
In [17]:
figsize(12, 6)
plt.scatter(range(grid.best_estimator_.coef_.shape[1]),
grid.best_estimator_.coef_)
plt.ylabel('value of coefficient')
plt.xlabel('predictor variable (index)');
Another way to do this is with the 'LogisticRegressionCV' function. This is a logistic regression function built with tuning $C$ via cross-validation in mind. This time, we'll set the penalty to $\mathcal{l}1$, we'll let python pick 10 possible $C$'s, we'll use all cores on my machine ('njobs=-1'), and we'll use the liblinear solver (which is the only one of the three possible choice which can optimize with the l1 penalty). The $\mathcal{l}1$ penalty is $$-\frac{1}{n}\sum{i=1}^n[y_i\log(p_i) - (1-y_i)\log(y_i - pi)]+\frac{1}{C}\cdot\sum{j=1}^m |w_j|.$$ This will take a minute or two to run.
In [18]:
cvmodel = LogisticRegressionCV(penalty='l1',
Cs=10,
n_jobs=-1,
verbose=1,
scoring='accuracy',
solver='liblinear') # liblinear only for l1 penalty
In [19]:
# takes about a minute
cv_fit = cvmodel.fit(Xtr,ytr)
In [20]:
cvmodel.C_
Out[20]:
In [21]:
cvmodel.coef_ # now all very small, most effectively 0
Out[21]:
In [22]:
plt.scatter(range(cvmodel.coef_.shape[1]), cvmodel.coef_[0])
plt.ylabel('value of coefficient')
plt.xlabel('predictor variable (index)')
plt.title('coefficients with l1 regularization');
In [23]:
cv_preds = cvmodel.predict(Xte)
print(accuracy_score(cv_preds, yte))
In [24]:
tuned_cv_scores = cross_val_score(cv_fit, X, y, scoring='accuracy',n_jobs=-1, verbose=2)
In [25]:
print(tuned_cv_scores)
print(np.mean(tuned_cv_scores))
In [26]:
default_cv_scores = cross_val_score(default_model.fit(Xtr, ytr), X, y, scoring='accuracy',n_jobs=-1, verbose=2)
In [27]:
print(default_cv_scores)
print(np.mean(default_cv_scores))
In [28]:
fig, ax = plt.subplots(1,2, sharey=True, figsize=(16, 6))
ax[0].scatter(range(grid.best_estimator_.coef_.shape[1]),
grid.best_estimator_.coef_)
ax[0].set_ylabel('value of coefficient')
ax[0].set_xlabel('predictor variable (index)')
ax[0].set_title('Coefficients with l2 Penalty')
ax[1].scatter(range(cvmodel.coef_.shape[1]), cvmodel.coef_[0])
ax[1].set_ylabel('value of coefficient')
ax[1].set_xlabel('predictor variable (index)')
ax[1].set_title('Coefficients with l1 Penalty');
In [29]:
trivial = np.isclose(cvmodel.coef_, np.zeros(shape=cvmodel.coef_.shape)).flatten()
nontrivial = []
for i in range(len(trivial)):
if not trivial[i]:
nontrivial.append(i)
In [30]:
nontrivial
Out[30]:
In [31]:
final = LogisticRegression(C=cvmodel.C_[0], penalty='l1', solver='liblinear').fit(X[:, nontrivial], y)
In [32]:
# thanks StackOverflow!
# see http://stackoverflow.com/questions/36373266/change-in-running-behavior-of-sklearn-code-between-laptop-and-desktop/37259431
import warnings
warnings.filterwarnings("ignore")
final_cv_scores = cross_val_score(final, X[:, nontrivial], y, scoring='accuracy', n_jobs=-1)
In [33]:
print(final_cv_scores)
print(np.mean(final_cv_scores))
In [34]:
alt = cross_val_score(LogisticRegressionCV(penalty='l1', solver='liblinear', verbose=2, n_jobs=-1), X[:, nontrivial], y, scoring='accuracy')
print(alt)
print(np.mean(alt))
In [ ]: